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Abstract 

The method of maximum entropy has been very successful but there 
are cases where it has either failed or led to paradoxes that have cast doubt 
on its general legitimacy. My more optimistic assessment is that such 
failures and paradoxes provide us with valuable learning opportunities to 
sharpen our skills in the proper way to deploy entropic methods. 

The central theme of this paper revolves around the different ways in 
which constraints are used to capture the information that is relevant to 
a problem. This leads us to focus on four epistemically different types 
of constraints. I propose that the failure to recognize the distinctions 
between them is a prime source of errors. 

I explicitly discuss two examples. One concerns the dangers involved 
in replacing expected values with sample averages. The other revolves 
around misunderstanding ignorance. I discuss the Priedman-Shimony 
paradox as it is manifested in the three-sided die problem and also in 
its original thermodynamic formulation. 

1 Introduction 

Our subject is entropic inference. The method of maximum entropy (whether in 
its original MaxEnt version or its generaUzation ME, the method for updating 
probabilities) has been successful in many applications but there arc cases where 
it has either failed or led to paradoxes. It has been suggested that these are 
symptoms of irreparable flaws. I disagree. My assessment is considerably more 
optimistic: the paradoxes provide us with valuable opportunities for learning 
how to use the method and equally valuable warnings about pitfalls we can and 
should avoid. 

*Presentcd at MaxEnt 2012, The 32nd International Workshop on Baycsian Inference and 
Maximum Entropy Methods in Science and Engineering, (July 15-20, 2012, Garching, Ger- 
many) . 



First, some backgroundj^ The objective of the method of maximum entropy 
(ME) is to update from a prior distribution g to a posterior distribution when 
information is given that the posterior P is constrained to belong to a certain 
family of distributions: P E C = {p}. The selected posterior P is that which 
maximizes the (relative) entropy 



subject to the constraint C. Justifying the method revolves to a large extent 
around justifying the particular choice of the functional S[p,q\. The criteria 
involved in designing the functional S[p,q] are purely pragmatic: 

(1) We seek a method of universal applicability. It is conceivable that dif- 
ferent situations could require different induction methods but what we want 
is a general-purpose method that captures what all those other problem- specific 
methods have in common^ 

(2) We want a parsimonious method that recognizes the value of information. 
What has been laboriously learned in the past should not be disregarded unless 
rendered obsolete by new information. Priors matter: rational beliefs should be 
updated but only to the minimal extent demanded by the new information. 

(3) The method must be useful in practice. In particular, in order to do sci- 
ence we must be able to understand parts of the universe without having to 
understand the universe as a whole. This implies that the notion of statistical 
independence must play a central and privileged role. This idea - that some 
things can be neglected, that not everything matters - is implemented by im- 
posing a criterion that tells us how to handle independent systems. The design 
criterion we adopt is quite natural: Whenever two systems are a priori believed 
to be independent and we receive information about one it should not matter if 
the other is included in the analysis or not^ 

The subtleties of how these criteria are implemented by imposing locality, coor- 
dinate invariance, and independence and the proofs showing that they lead to 
the entropy functional ([T]) are discussed in [8]. A noteworthy feature is that it 
is not necessary to provide an interpretation for S[p,q] be it in terms of heat, 
or disorder, or amounts of information. Entropy is a tool for updating that 
requires no further interpretation. 

The task of entropic inference - to update rational beliefs when information 

make no attempt to provide a review of the literature on entropic inference. The following 
list, which reflects only some contributions that are directly related to the particular approach 
described in this tutorial, is incomplete but might nevertheless be useful: Jaynes [l], Shore 
and Johnson j2], Williams [3], Skilling g], Rodri guez [5j, GifBn and Caticha [6]-[8]. 

^This approach to entropic inference includes Jaynes' MaxEnt method and Bayesian infer- 
ence as special cases. The MaxEnt method is recovered when the prior q reflects an underlying 
measure or (up to a normalization factor) a uniform distribution. Bayesian inference is recov- 
ered when the goal is to infer parameters 9 on the basis of information about data x and the 
relation between x and as given by a known likelihood function q{x\6). [3][6] 

•^This amounts to requiring that independence be preserved unless information to the 
contrary is explicitly introduced which is in accordance with the principle of parsimony stated 
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in (2). 
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becomes available ~ immediately points both to the question "What is infor- 
mation?" and to its answer. A fully Bayesian information theory demands a 
close relation between information and the beliefs of an ideally rational agent 
and, accordingly, the answer is: information is that which affects rational be- 
liefs, and thus, constraints are information. Which leads us to the main topic 
of this paper: how to deploy constraints that correctly capture the information 
that is relevant to a problem. In the next section I argue that we ought to 
distinguish four epistemically different cases and that the failure to recognize 
the distinctions among them is a prime source of mistakes and paradoxes. 

Over the years a number of objections have been raised against the method of 
maximum entropy. I believe some of these objections were quite legitimate at the 
time they were raised. They uncovered conceptual pitfalls with the old MaxEnt 
as it was understood at the time. I also believe that in the intervening decades 
our understanding of entropic inference has evolved to the point that all these 
concerns can now be addressed satisfactorily. I explicitly discuss two examples 
of such mistakes. One concerns the dangers of replacing expected values with 
sample averages (a good discussion of this so-called "constraint rule" is e.^. [Hj)- 
The other revolves around misunderstanding ignorance]^ These are objections 
of the type raised by the Friedman-Shimony paradox [10] [11] as it is manifested 
in the three-sided die problem [5]|12| and also in its original thermodynamic 
formulationl3 

2 On constraints and relevant information 

To fix ideas consider the standard MaxEnt problem: to assign the probability 
of a discrete variable i assuming a uniform underlying measure qi = const and 
information in the form of a single linear constraint, (/) = F. MaxEnt requires 
us to maximize the Shannon entropy S[p] = —^^^Pi^ogpi subject to (/) = F 
and ^iPi — 1 which yields p{i\X) oc e~^-^' for an appropriately chosen A. 

For example, the canonical distribution that describes the state of thermody- 
namic equilibrium is obtained maximizing S[p] subject to a constraint on the ex- 
pected energy (e) = E. This yields the Boltzmann distribution, oc e~^^% 
where /3 = l/T is the inverse temperature. The questions that concern us here 
are: How do we decide which is the right constraint function / to choose? How 
do we decide the numerical value F of its expectation? When can we expect 
the inferences to be reliable? 

When using the MaxEnt method to obtain, say, the Boltzmann distribution 
it has been common to adopt the following language: 

*We use 'ignorance' as a technical term to denote lack of knowledge or lack of information. 
The term 'uncertainty' might also be apropriate except that through its heavy use in so many 
other contexts it has acquired other connotations that could be misleading. We do not mean 
to suggest that a situation of ignorance is in any way morally reprehensible. 

^For additional references on the controversy around the Friedman-Shimony paradox see 
[9]. For yet another paradox that can be handled with the arguments we deploy here see 
|12 | |13 | . Other objections raised by these authors, such as the compatibility of Bayesian and 
entropic methods, have been fully addressed elsewhere. [6] [8] 
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We seek the probability distribution that codifies the information wc ac- 
tually have {e.g., the expected energy) and is maximally unbiased (i.e. 
maximally ignorant or maximum entropy) about all the other information 
we do not possess. 

This justification has stirred considerable controversy. Some of the objections 
that have been raised are the following: 

01 Objective reality is independent of our subjective knowledge about it. The 

observed spectrum of black body radiation is what it is independently of 
whatever information happens to be available to us. 

02 In most realistic situations the expected value of the energy is not a quantity 

wc happen to know. How, then, can we justify using it as information we 

actually have? 

03 Even when the expected values of some quantities happen to be known, 

there is no guarantee that the resulting inferences will be any good at all. 
How can we justify the success of thermodynamics? 

These objections deserve our consideration. 

The issue raised by Ol concerns the very essence of what physical theories 
are meant to be. Let us grant for the sake of argument that there is such a thing 
as an external reality, that real phenomena are what they are independently of 
our thoughts about them. Then the issue raised by Ol is whether the purpose 
of our theories is to provide models that faithfully mirror this external reality or 
whether the connection to reality is considerably more indirect and the models 
are pragmatic tools for manipulating information about reality for the purposes 
of prediction, control, explanation, etc. In the former case theories mirror reality 
and there is no logical room for subjectivity. In the latter case theories deal with 
our information about reality and some subjective elements are inevitable. The 
evidence in favor of the latter alternative is already considerable. It includes 
the successful derivation and a host of insights into statistical mechanics and 
thermodynamics achieved by Jaynes' MaxEnt. It also includes the more recent 
entropic/Bayesian derivations of both quantum and classical mechanics. 

The objection Ol originates in a failure to recognize that while those cpis- 
temic judgments that must be made when assigning probabilities are inevitably 
subjective, they can nevertheless still be objectively right or wrong depending 
on whether they achieve empirical success or not. Thus, wc can evade Ol by 
refusing to wear a straight jacket that forces us into a strict subjective/objective 
dichotomy. Subjective judgments do not preclude inferences that are objectively 
correct or incorrect. 

To address objections 02 and 03 it is useful to distinguish four epistemically 
different types of constraints: 

(A) We know that the expected value of the function / captures information 
that happens to be relevant to the particular problem at hand and we also 
know its numerical value, (/) = F. 
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The ideal situation is one in which the set of available constraints of type A is 
complete in the sense that all the information that is necessary to obtain reliable 
answers to the questions that interest us is available. Only then are we guar- 
anteed reliable predictionsj^ Both requirements of relevance and completeness 
are crucial: an incomplete set of type A constraints has predictive value in that 
it leads to the best predictions that one can achieve under the circumstances 
but there is no guarantee that the predictions will be any good. Thus we see 
that, properly understood, objection 03 is not a flaw of the entropic method; 
it is a legitimate warning that reasoning with incomplete information is a risky 
business. 

Note that a particular piece of evidence can be relevant and complete for 
some questions but not for others. For example, the expected energy (e) = E 
is both relevant and complete for the question "Will system 1 be in thermal 
equilibrium with another system 2?" or alternatively, "What is the temperature 
of system 1?" But the same expected energy is far from complete for the vast 
majority of other possible questions such as, for example, "Where can we expect 
to find molecule #23 in this sample of ideal gas?" 

(B) We know that (/) captures information that happens to be relevant to the 
problem at hand but its actual numerical value F is not known. 

This is the most common situation in physics. The answer to objection 02 
hinges on the observation that whether the value of the expected energy E is 
known or not, it is nevertheless still true that maximizing entropy subject to the 
energy constraint (e) — E leads to the objectively correct family of distributions 
that describe thermal equilibrium (including, for example, the observed black- 
body spectral distribution). Thus, the justification behind imposing a constraint 
on the expected energy is not that the quantity E happens to be known - because 
of the brute fact that its value is never actually known ~ but rather that it is 
a quantity that should be known. Even when the actual numerical value E 
is unknown, the epistemic situation described in case B is one in which we 
know that the expected energy (e) is the relevant information without which no 
successful predictions are possible. 

The question of how a particular / is singled out as relevant has to be tackled 
on a case by case basis. In 8J we discuss the problem of thermal equilibrium 
and show that the relevant quantity is indeed the expected energy (e) and not 
some other conserved quantity such as (e^) or (/(e)). 

Type B information is processed by allowing MaxEnt to proceed with the 
numerical value of (e) — E handled as a free parameter. This leads us to the 

^Our goal here has been merely to describe the epistemically ideal situation one would like 
to achieve. The important question of how to assess whether a particular set of constraints is 
relevant and complete for any specific issue at hand will not be addressed here except to point 
out that the criteria of success are purely pragmatic. More specifically, Jaynes has suggested 
that the appropriate criterion is reproducibility: "If any macrophenomenon is found to be 
reproducible, then it follows that all microscopic details that were not reproduced, must be 
irrelevant for understanding and predicting it. In particular, all circumstances that were not 
under the experimenter's control are very likely not to be reproduced, and therefore are very 
likely not to be relevant." I14j (See also Section 5.8 of S .) 
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correct family of distributions oc e~^^» containing the multiplier /? as a 

free parameter. The actual value of the parameter /3 is at this point unknown 
and the standard approach is to seek additional information and infer (3 either by 
a direct measurement using a thermometer, or to infer it indirectly by Bayesian 
parameter estimation from other empirical data. The additional information 
has the net effect of transforming the type B constraint into a type A. 

(C) There is nothing special about the function / except that we happen to 
know its expected value, (/) = F. In particular, wc; do not know whether 
information about (/) is complete or whether it is at all relevant to the 
problem at hand. 

We do know something and this information, although limited, has some pre- 
dictive value because it serves to constrain our attention to the subset of prob- 
ability distributions that agree with it. Maximizing entropy subject to such a 
constraint will yield predictions that are the best possible under the circum- 
stances but since we do not know that / captures information that is relevant 
and complete there is absolutely no guarantee that the predictions will be any 
good. Induction is risky and objection 03 is a healthy reminder. 

(D) We know neither that (/) captures relevant information nor do we know 
its numerical value F. 

This is an epistemic situation that reflects complete ignorance. Case D applies 
to any arbitrary function / and therefore it applies equally to all functions. 
Since no specific / is singled out a type D constraint provides no information at 
all and the correct procedure is to maximize 5*1^] subject to the single constraint 
of normalization. The result is as it should be: extreme ignorance is described 
by a uniform distribution. 

What distinguishes type C from D is that in C the value of F is actually 
known. This fact singles out a specific variable / and justifies using (/) = as a 
constraint. What distinguishes D from B is that in B there is actual knowledge 
that singles out the variable / as being relevant. 

Objection 02 arises from a failure to distinguish constraints of type B from 
those of type C and D. 

Summary: Between one extreme of ignorance (type D, we know neither which 
variables are relevant nor their expected values), and the other extreme of use- 
ful knowledge (a complete set of type A constraints in which we know all the 
relevant variables that need to be included in the analysis and we also know 
their expected values), there are intermediate states of knowledge (involving 
constraints of types B and C) and these constitute the rule rather than the 
exception. (It is, of course, also possible to encounter situations that mix con- 
straints of different types.) Type B is the more common and important situation 
in which a relevant variable has been correctly identified even though its actual 
expected value might be unknown. The situation described as type C is less 
common because information about expected values is not usually available. 
(What might be easily available is information in the form of sample averages 
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which is not in general quite the same thing see the next section.) Type D 
constraints carry no information at all and should be ignored. 

The considerations above also apply to the problem of entropic updating 
from a generic prior g to a posterior by maximizing S\p, q] subject to information 
in the form of constraints. 

3 Sample averages are not expected values 

Let us return to the question "If constraints refer to the expectation of certain 
variables, how do we decide their numerical magnitudes?" and explore some 
pitfalls. Here is a common temptation: the numerical values of expectations 
are seldom known and it is tempting to follow the "constraint rule" and replace 
expected values by sample averages because it is the latter that are directly 
available from experiment. But the two arc not the same: Sample averages are 
experimental data. Expected values are not experimental data. 

For very large samples such a replacement can be justified by the law of large 
numbers there is a high probability that sample averages will approximate 
the expected values. However, for small samples using one as an approximation 
for the other can lead to incorrect inferences. It is important to realize that these 
incorrect inferences do not represent an intrinsic flaw of the entropic method; 
they are merely an indication of how the method should not to be used. 
Example — just data: 

Here is a variation on the same theme. Suppose data D = {xi,X2 ■ ■ -Xn) 
have been collected. How do we process such information? Suppose we do not 
have a likelihood function so Bayes rule is not an option. We might be tempted 
to maximize S[p. q] subject to a constraint {x) = Ci where Ci is unknown 
and then try to estimate C\ as a sample average. This is a dangerous move. 
The reason is that in the absence of additional information we know neither 
that X constitutes relevant information nor do we know its expected vahie Ci 
and therefore this is what we identified above as a type D constraint — no 
information at all. 

The mistake becomes apparent when we realize that if we know the data 
. . .) then we also know their squares {x^, . . .) and their cubes and also any 
arbitrary function of them (/(xi), . . .). And we also know the corresponding 
sample averages. Which of these should we use as an expected value constraint? 
Or should we use all of them? The answer is that the entropic method is not de- 
signed to tackle problems where the only information is data D = {xi,X2 ■ ■ ■ Xn). 
It is not that it gives a wrong answer; it gives no answer at all because there is 
no constraint to impose; the entropic engine cannot even get started. 

But there is a possible exception: surely the data (a;i, . . .) must be relevant 
to inferences about the quantity x itself. More generally, whether a given piece 
of data turns out to be relevant or not depends on what is the question being 
asked. If we want to make inferences about a particular function f{x) then 
we know that information about {f{x)) must surely be relevant — in which 
case we deal with a type B constraint. Whether the information captured by 
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{f{x)) is sufBcient for reliable inferences, or whether higher moment constraints, 
(/^), (/^) . . . must also be included is a question that must be addressed on a 
case by case basis. 

Example — a type B constraint plus data: 

Suppose then, that in addition to the data D = {xi,X2 ■ ■ -Xn) collected in 
n independent experiments we have information described as type B in the 
previous section: the expectation (/) captures relevant information. Then we 
can proceed to maximize the entropy S[p, g], where q{xi) is a (possibly uniform) 
prior distribution, subject to the constraint (/) = F where the unknown F is 
treated as a free parameter. If the variable x can take k discrete values labeled 
by i we let q{xi) = qi and f{xi) = fi and the result is a canonical distribution 

Pixr\X) = ^*e-^/' , (2) 



Z = J2q.e-'f^ and (/) = (3) 



where 

7 = \^n,p-^fi and / f\ ^ 

dX 

i=l 

with an unknown multiplier A that can be estimated from the data D using 
standard Bayesian methods. Assuming the n experiments are independent then 
Bayes rule gives, 

where p{X) is the prior for A. It is convenient to consider the logarithm of the 
posterior, 

n 

log p{X\D) = logp(A) - logp(D) - ^(logZ - logq, + A/,) . 



The value of A that maximizes the posterior p{X\D) is such that 

ogp(A) _ 9 log 
dX " dX 



d\ogp{X) dXogZ - 
= n— nf . (5) 



where / is the sample average. 



1 



n 



Using ([3]) we see that the expected value (/) (and its corresponding A) can be 
estimated from the data as 

</>^/-«. (7, 

As n — )• 00 the second term on the right hand side vanishes and we see that the 
optimal A is such that (/) = /. This is to be expected; as is usual in Bayesian 
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inference for large n the data D overwhelms the prior p{X) and / tends to (/) 
(in probability). But the result eq.([7]) also shows that when n is not large then 
the prior can make a non-negligible contribution. In general one should not 
assume that (/) ~ / . 

Let us emphasize that this analysis holds only when there is additional 
knowledge to the effect that the specific variable / captures relevant informa- 
tion. In the absence of such knowledge we are back to the previous example - 
just data - and we have no reason to prefer the function f{x) over any other 
function g{x) and accordingly we have no reason to prefer the distribution (|2| 
over any other canonical distribution (7,je~^^'/ZFj 



4 Confusion about ignorance 

To set the stage for the issues involved consider a three-sided die. Its faces are 
labeled by the number of spots i = 1, 2, 3 and have probabilities 9i, 02, 9^ which 
will be collectively denoted by 9. The space of distributions is the simplex ^2 
with — 1 ^ shown in the figure. A fair die is one for which 9 ^ 9c = 

(|, |, |) which lies at the very center of the simplex. The expected number of 
spots for a fair die is (i) = 2. Having (i) = 2 is no guarantee that the die is fair 
but if (i) ^ 2 the die is necessarily biased. 

The paradox discussed by Friedman and Shimony [10 and further explored 
in [TT], [12] and [S] arises from analyzing a situation of complete ignorance in 
two ways that are (mistakenly) thought to be equivalent. 

Here is the first way to express complete ignorance: 
Ignorancei — Nothing is known about the die; we do not know that it is 
fair but on the other hand there is nothing that induces us to favor one face 
over another. On the basis of this minimal information we can use MaxEnt. 
Maximize 

S{9) = -Y^ 9Jog9, (8) 

subject to ^ - 6i — 1 and the resulting maximum entropy distribution is 6me = 
9c- 

To set the background for the second way to represent complete ignorance 
consider constraints of the form (i) = r with 1 < r < 3. In the figure the 
constraints corresponding to r = 2 and to a generic r are shown as vertical 
dashed lines. Maximizing S{6) subject to (i) = 2 and normalization leads us 
to assign 9m e = 9c at the center of the simplex. Maximizing S{9) subject to 
(z) = r and normalization leads to the point where the r line crosses the dotted 
line. The dotted curve is the set of MaxEnt distributions 9 me {i') as r spans the 
range from 1 to 3. 

Here is the second way to express complete ignorance: 

^In 1151 (pp. 72- 75) Jaynes discussed the "constraint rule" by carrying out a similar Max- 
Ent/maximum likelihood analysis which failed to include the contribution from the prior for 
A [the second term on the right of eq.|7|]. His conclusion that we are always justified to set 
{/) = / is therefore doubly wrong — it is both necessary that / reflect relevant information 
and that the correction due to the prior be negligible. 
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Figure 1: Constraints (i) = r are shown as vertical lines on the simplex. The 

dotted line is the set of MaxEnt distributions 6m e {r) as r spans the range from 
1 to 3. If r is unknown the average of 9me{''') over r leads to the distribution 
marked by 6. 



Ignorance2 — It is tempting (but wrong!) to pursue the following line of 
thought. We have a die but we do not know much about it. We do know, 
however, that the quantity {i) must have some value, call it r, about which we 
are ignorant too. Now, the most ignorant distribution given r is the MaxEnt 
distribution 9ME{r)- But r is itself unknown so a more honest assignment for 9 
would be an average over r, 

9 = J drp{r)9ME{r) , (9) 

where p{r) reflects our uncertainty about r. It may, for example, make sense to 
pick a uniform distribution over r but the precise choice is not important for our 
purposes. The point is that since the MaxEnt dotted curve is concave the point 
6 necessarily lies below 9c so that 62 < 1/3. And we have a paradox: we started 
assuming complete ignorance and through a process that claims to express full 
ignorance at every step we reach the conclusion that the die is biased against 
i = 2. Where is the mistake? 

Another way to expose the problem is to force the issue and impose that the 
two descriptions of complete ignorance be equivalent by fiat, 

9c = 9 . (wrong) 

Since the dotted line is concave this equality can only be achieved \ip{r) = S{r — 
2). And, again, we have a paradox: we started admitting complete ignorance 
about the die and therefore also about the value of r and we end with complete 
certainty that r = 2. Where is the mistake? 
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Before we blame the entropic method of inference it is best to take a closer 
look at Ignorance2. One clue is symmetry. A situation of complete ignorance 
ought to treat the outcomes i = 1,2,3 symmetrically but the end result is a 
distribution that is biased against i = 2. The symmetry must have been broken 
somewhere and it is clear that this happened at the moment we imposed the 
constraint on (i) — r which is shown as vertical lines on the simplex. Had we 
chosen to express our ignorance not in terms of the unknown value of (i) = r but 
in terms of some other function (/(?)) = s then we could have easily broken the 
symmetry in some other direction. For example, let f{i) be a cyclic permutation 
of i, 

/(I) = 2, /(2) = 3, and /(3) = 1 , (10) 

then repeating the analysis above would lead us to conclude that 9i < 1/3, 
which represents a die biased against i — 1. Thus, the question becomes: What 
leads us to choose a constraint on (i) rather than a constraint on (/(«)) when 
we are equally ignorant about both? 

The discussion in section 2 is relevant here. The paradox with the three- 
sided die arises because a constraint of type D has been treated as if it were a 
constraint of type B. The correct approach is to recognize that we do not know 
whether it is the constraint {i) or any other function (/) that captures relevant 
information and their numerical values r are also unknown — clearly a type D 
constraint. There is nothing to single out (i) or any other (/) and therefore 
the correct inference consists of maximizing S imposing the only constraint we 
actually know, namely, normalization. The result is as it should be — a uniform 
distribution {6 me = Oc) which agrees with Ignorancei. 

On the other hand, the Ignorance2 argument that led to the assignment of 9 
in cq. ^ and to ^2 < 1 /3 would have been correct if we actually had knowledge 
that it is the particular variable (i) - and not any other (/) - that captures 
information that is relevant to this very particular die. Thus, imposing the type 
B constraint {i) — r when r is unknown and then averaging over r represents 
a situation in which we know something. There is some ignorance here - we do 
not know r - but this is not extreme ignorance. 

We can summarize as follows: knowing that the die is biased against i = 2 
but not knowing by how much (Ignorance2) is not the same as not knowing 
anything about the die (Ignorancei). 

A thermodynamic version of the paradox is discussed in pTj . Here is the 
background: A physical system can be in any of n microstates labeled i — 1 . . .n. 
When we know absolutely nothing about the system (Ignorancei) maximizing 
entropy subject to the single constraint of normalization leads to a uniform 
probability distribution, 

Puii) = l/n . (11) 

A different (and wrong!) way to express complete ignorance (Ignorance2) 
is to argue that the expected energy (e) must have some value E about which we 
are ignorant. Maximizing entropy subject to both (e) = E and normalization 
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leads to the usual Boltzmann distributions, 



Pm = ^ where = E ' (^2) 

Since the inverse temperature /3 = P{E) is itself unknown we must average over 

p{i) = J dPp[P)p{i\fi) . (13) 

To the extent that both distributions are thought to reflect complete ignorance 
we must impose 

Pu{i) = p{i) (wrong again) 

which can be shown (see [TT]) to imply that 

= 5{P) or /? = . (14) 

Indeed, setting the Lagrange multiplier /3 = in p{i\P) leads to the uniform 
distribution pu{i)- And now we have a paradox: complete ignorance about 
the system (Ignorancei) implies we are ignorant about its temperature. In 
fact, the system might not be in thermal equilibrium in which case it may not 
even have a temperature at all. But we also have the second way of expressing 
ignorance (Ignorance2) and if we impose that the two agree we are led to 
conclude that j3 has the value /3 = so that the temperature is precisely known. 
From complete ignorance we have (wrongly) concluded that the system must 
be infinitely hot — confusion about ignorance is hell. 

The paradox is dissolved once we realize that, just as with the die problem, 
a type D constraint has been treated as type B. Knowing nothing about a 
system means we do not know whether it is in equilibrium or not. We do not 
know whether it is isolated and whether its energy is conserved and therefore 
it is not clear that (e) might even be relevant information. This is the kind of 
non-information we earlier called a type D constraint that ought to be ignored. 

On the other hand if we were to have actual knowledge that the system is in 
thermal equilibrium then it would be legitimate to impose a constraint on the 
expected energy {e) — as discussed in 8J thermal equilibrium is the physical 
condition that singles out the expected energy (e) as being the relevant piece of 
information. Since the temperature is unknown this is a type B constraint. 

To summarize: knowing that a system is in thermal equilibrium while being 
ignorant about its temperature is not the same as knowing nothing about the 
system. 

It may be worthwhile to rephrase this important point in yet another way. 
Let i €l and (3 G B where X is the space of microstates and B is the space of 
some arbitrary quantities /?. The rules of probability theory allow us to write 

p{i) = I dl3p{i,l3) where p{i, l3) = p{P)p{i\l3) . (15) 
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Paradoxes will easily arise if we fail to distinguish a situation of complete igno- 
rance from a situation where we have actual knowledge about the conditional 
probability p{i\P), which is what gives the parameter (3 its meaning as inverse 
temperature. 

5 Summary 

In cntropic inference information is represented by constraints. In this tutorial 
we have argued that when constraints are expressed in terms of expected values 
the same formal expression (/) = F can be used to represent four different 
types of available information and failure to distinguish between them can lead 
to errors. 

In decreasing order of useful information the types range from type A, in 
which we know that the quantity / is relevant to the inference at hand and its 
expected value F is known; to type B, in which / is known to be relevant but 
F is unknown; and type C, in which / is of interest only because F happens to 
be known. Constraints of type A and B are common and therefore important; 
type C constraints are less so because information about expected values (as 
opposed to sample averages) is less directly available. 

Finally, there are situations of no information at all, labeled as type D, in 
which we know neither that / is relevant nor its expected value F. The point 
of identifying and labeling the non-informative type D is precisely as to avoid 
the errors that arise from confusing D with the more informative types A, B, 
and C. 

The usefulness of distinguishing these four types was illustrated by discussing 

the constraint rule and the Friedman-Shimony paradox. 

Acknowledgements: I am grateful to Nestor Caticha, R. Fischer, A. Giffin, 
A. Golan, R. Preuss, C. Rodriguez, T. Seidenfeld, J. Skilling and J. UfHnk for 
many useful discussions on entropic inference. 
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